Sebastián Pérez Saaibi
This presentation borrows intends to be an introduction to the practical aspects of Machine Learning for students of the Computational Methods Uniandes course. The content provided below borrows heavily from Coursera’s Data Science Specialization by Jeff Leek, Roger Peng and Brian Caffo (Johns Hopkins University).
# load data
library(kernlab); data(spam); set.seed(333)
# picking a small subset (10 values) from spam data set
smallSpam <- spam[sample(dim(spam)[1],size=10),]
# label spam = 2 and ham = 1
spamLabel <- (smallSpam$type=="spam")*1 + 1
# plot the capitalAve values for the dataset with colors differentiated by spam/ham (2 vs 1)
plot(smallSpam$capitalAve,col=spamLabel)
# first rule (over-fitting to capture all variation)
rule1 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.7] <- "spam"
prediction[x < 2.40] <- "nonspam"
prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
return(prediction)
}
# tabulate results of prediction algorithm 1 (in sample error --> no error in this case)
table(rule1(smallSpam$capitalAve),smallSpam$type)
##
## nonspam spam
## nonspam 5 0
## spam 0 5
# second rule (simple, setting a threshold)
rule2 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.8] <- "spam"
prediction[x <= 2.8] <- "nonspam"
return(prediction)
}
# tabulate results of prediction algorithm 2(in sample error --> 10% in this case)
table(rule2(smallSpam$capitalAve),smallSpam$type)
##
## nonspam spam
## nonspam 5 1
## spam 0 4
# tabulate out of sample error for algorithm 1
table(rule1(spam$capitalAve),spam$type)
##
## nonspam spam
## nonspam 2141 588
## spam 647 1225
# tabulate out of sample error for algorithm 2
table(rule2(spam$capitalAve),spam$type)
##
## nonspam spam
## nonspam 2224 642
## spam 564 1171
# accuracy and total correct for algorithm 1 and 2
rbind("Rule 1" = c(Accuracy = mean(rule1(spam$capitalAve)==spam$type),
"Total Correct" = sum(rule1(spam$capitalAve)==spam$type)),
"Rule 2" = c(Accuracy = mean(rule2(spam$capitalAve)==spam$type),
"Total Correct" = sum(rule2(spam$capitalAve)==spam$type)))
## Accuracy Total Correct
## Rule 1 0.7315801 3366
## Rule 2 0.7378831 3395
* a randomly sampled test set is subsetted out from the original training set * the predictor is built on the remaining training data and applied to the test set * the above are three random subsamplings from the same training set * considerations - must be done without replacement - random sampling with replacement = bootstrap + underestimates of the error + can be corrected, but it is complicated (0.632 Bootstrap)
* break training set into \(K\) subsets (above is a 3-fold cross validation) * build the model/predictor on the remaining training data in each subset and applied to the test subset * rebuild the data \(K\) times with the training and test subsets and average the findings * considerations
- larger k = less bias, more variance - smaller k = more bias, less variance
* leave out exactly one sample and build predictor on the rest of training data * predict value for the left out sample * repeat for each sample
caret Package (tutorial)preProcess()createDataPartition(), createResample(), createTimeSlices()train(), predict()confusionMatrix()caret package
caret provides uniform framework to build/predict using different models
caret package allows algorithms to be run the same way through predict() functioncreateDataPartition(y=data$var, times=1, p=0.75, list=FALSE) –> creates data partitions using given variable
y=data$var = specifies what outcome/variable to split the data ontimes=1 = specifies number of partitions to create (number of data splitting performed)p=0.75 = percent of data that will be for training the modellist=FALSE = returns a matrix of indices corresponding to p% of the data (training set)
list = FALSE is generally what is used list=TRUE = returns a list of indices corresponding to p% of the data (training set)training<-data[inTrain, ] = subsets the data to training set onlytesting<-data[-inTrain, ] = the rest of the data set can then be stored as the test set# load packages and data
library(caret)
# create training set indexes with 75% of data
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
# subset spam data to training
training <- spam[inTrain,]
# subset spam data (the rest) to test
testing <- spam[-inTrain,]
# dimension of original and training dataset
rbind("original dataset" = dim(spam),"training set" = dim(training))
## [,1] [,2]
## original dataset 4601 58
## training set 3451 58
createFolds(y=data$var, k=10, list=TRUE, returnTrain=TRUE) = slices the data in to \(k\) folds for cross validation and returns \(k\) lists of indices
y=data$var = specifies what outcome/variable to split the data onk=10 = specifies number of folds to create (See K Fold Cross Validation)
list=TRUE = returns k list of indices that corresponds to the cross-validated sets
k datasets/vectors of indices, so list=TRUE is generally what is used folds in the case), you can use folds[[1]][1:10] to access different elements from that listlist=FALSE = returns a vector indicating which of the k folds each data point belongs to (i.e. 1 - 10 is assigned for each of the data points in this case)
list=T] returnTrain=TRUE = returns the indices of the training sets
returnTrain=FALSE = returns indices of the test sets# create 10 folds for cross validation and return the training set indices
folds <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=TRUE)
# structure of the training set indices
str(folds)
## List of 10
## $ Fold01: int [1:4141] 1 2 3 4 6 7 8 9 10 12 ...
## $ Fold02: int [1:4140] 1 2 3 4 5 7 8 9 10 11 ...
## $ Fold03: int [1:4141] 1 3 4 5 6 7 8 9 10 11 ...
## $ Fold04: int [1:4141] 2 3 4 5 6 7 8 9 10 11 ...
## $ Fold05: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold06: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold07: int [1:4141] 1 2 3 4 5 6 8 9 11 12 ...
## $ Fold08: int [1:4141] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold09: int [1:4141] 1 2 4 5 6 7 10 11 12 13 ...
## $ Fold10: int [1:4141] 1 2 3 5 6 7 8 9 10 11 ...
# return the test set indices instead
# note: returnTrain = FALSE is unnecessary as it is the default behavior
folds.test <- createFolds(y=spam$type,k=10,list=TRUE,returnTrain=FALSE)
str(folds.test)
## List of 10
## $ Fold01: int [1:460] 15 16 18 40 45 62 68 81 82 102 ...
## $ Fold02: int [1:459] 1 41 55 58 67 75 117 123 151 175 ...
## $ Fold03: int [1:461] 3 14 66 69 70 80 90 112 115 135 ...
## $ Fold04: int [1:460] 5 19 25 65 71 83 85 88 91 93 ...
## $ Fold05: int [1:460] 6 10 17 21 26 56 57 104 107 116 ...
## $ Fold06: int [1:459] 7 8 13 39 52 54 76 89 99 106 ...
## $ Fold07: int [1:461] 4 23 27 29 32 33 34 38 49 51 ...
## $ Fold08: int [1:460] 2 9 30 31 36 37 43 46 47 48 ...
## $ Fold09: int [1:461] 12 20 24 44 53 59 60 64 84 98 ...
## $ Fold10: int [1:460] 11 22 28 35 42 61 72 86 92 118 ...
# return first 10 elements of the first training set
folds[[1]][1:10]
## [1] 1 2 3 4 6 7 8 9 10 12
createResample(y=data$var, times=10, list=TRUE) = create 10 resamplings from the given data with replacement
list=TRUE = returns list of n vectors that contain indices of the sample
times=10 = number of samples to create# create 10 resamples
resamples <- createResample(y=spam$type,times=10,list=TRUE)
# structure of the resamples (note some samples are repeated)
str(resamples)
## List of 10
## $ Resample01: int [1:4601] 1 4 4 4 7 8 12 13 13 14 ...
## $ Resample02: int [1:4601] 3 3 5 7 10 12 12 13 13 14 ...
## $ Resample03: int [1:4601] 1 2 2 3 4 5 8 10 11 12 ...
## $ Resample04: int [1:4601] 1 3 3 4 7 8 8 9 10 14 ...
## $ Resample05: int [1:4601] 2 4 5 6 7 7 8 8 9 12 ...
## $ Resample06: int [1:4601] 3 6 6 7 8 9 12 13 13 14 ...
## $ Resample07: int [1:4601] 1 2 2 5 5 6 7 8 9 10 ...
## $ Resample08: int [1:4601] 2 2 3 4 4 7 7 8 8 9 ...
## $ Resample09: int [1:4601] 1 4 7 8 8 9 12 13 15 15 ...
## $ Resample10: int [1:4601] 1 3 4 4 7 7 9 9 10 11 ...
createTimeSlices(y=data, initialWindow=20, horizon=10) = creates training sets with specified window length and the corresponding test sets
initialWindow=20 = number of consecutive values in each time slice/training set (i.e. values 1 - 20)horizon=10 = number of consecutive values in each predict/test set (i.e. values 21 - 30)fixedWindow=FALSE = training sets always start at the first observation
# create time series data
tme <- 1:1000
# create time slices
folds <- createTimeSlices(y=tme,initialWindow=20,horizon=10)
# name of lists
names(folds)
## [1] "train" "test"
# first training set
folds$train[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# first test set
folds$test[[1]]
## [1] 21 22 23 24 25 26 27 28 29 30
train(y ~ x, data=df, method="glm") = function to apply the machine learning algorithm to construct model from training data# returns the arguments of the default train function
args(train.default)
## function (x, y, method = "rf", preProcess = NULL, ..., weights = NULL,
## metric = ifelse(is.factor(y), "Accuracy", "RMSE"), maximize = ifelse(metric ==
## "RMSE", FALSE, TRUE), trControl = trainControl(), tuneGrid = NULL,
## tuneLength = 3)
## NULL
train function has a large set of parameters, below are the default options
method="rf" = default algorithm is random forest for training a given data set; caret contains a large number of algorithms
names(getModelInfo()) = returns all the options for method argumentpreProcess=NULL = set preprocess options (see Preprocessing)weights=NULL = can be used to add weights to observations, useful for unbalanced distribution (a lot more of one type than another)metric=ifelse(is.factor(y), "Accuracy", "RMSE") = default metric for algorithm is Accuracy for factor variables, and RMSE, or root mean squared error, for continuous variables
maximize=ifelse(metric=="RMSE", FALSE, TRUE) = the algorithm should maximize accuracy and minimize RMSEtrControl=trainControl() = training controls for the model, more details belowtuneGrid=NULLtuneLength=3# returns the default arguments for the trainControl object
args(trainControl)
## function (method = "boot", number = ifelse(grepl("cv", method),
## 10, 25), repeats = ifelse(grepl("cv", method), 1, number),
## p = 0.75, initialWindow = NULL, horizon = 1, fixedWindow = TRUE,
## verboseIter = FALSE, returnData = TRUE, returnResamp = "final",
## savePredictions = FALSE, classProbs = FALSE, summaryFunction = defaultSummary,
## selectionFunction = "best", preProcOptions = list(thresh = 0.95,
## ICAcomp = 3, k = 5), index = NULL, indexOut = NULL, timingSamps = 0,
## predictionBounds = rep(FALSE, 2), seeds = NA, adaptive = list(min = 5,
## alpha = 0.05, method = "gls", complete = TRUE), allowParallel = TRUE)
## NULL
trainControl creates an object that sets many options for how the model will be applied to the training data
method="boot" =
"boot" = bootstrapping (drawing with replacement)"boot632" = bootstrapping with adjustment"cv" = cross validation"repeatedcv" = repeated cross validation"LOOCV" = leave one out cross validationnumber=ifelse(grepl("cv", method),10, 25) = number of subsamples to take
number=10 = default for any kind of cross validationnumber=25 = default for bootstrappingnumber should be increased when fine-tuning model with large number of parameter repeats=ifelse(grepl("cv", method), 1, number) = numbers of times to repeat the subsampling
repeats=1 = default for any cross validation methodrepeats=25 = default for bootstrappingp=0.75 = default percentage of data to create training setsinitialWindow=NULL, horizon=1, fixedWindow=TRUE = parameters for time series dataverboseIter=FALSE = print the training logsreturnData=TRUE, returnResamp = “final”,savePredictions=FALSE = save the predictions for each resampleclassProbs=FALSE = return classification probabilities along with the predictionssummaryFunction=defaultSummary = default summary of the model,preProcOptions=list(thresh = 0.95, ICAcomp = 3, k = 5) = specifies preprocessing options for the modelpredictionBounds=rep(FALSE, 2) = specify the range of the predicted value
predictionBounds=c(10, NA) would mean that any value lower than 10 would be treated as 10 and no upper boundsseeds=NA = set the seed for the operation
train function is run allowParallel=TRUE = sets for parallel processing/computationsfeaturePlot(x=predictors, y=outcome, plot="pairs") = short cut to plot the relationships between the predictors and outcomes# load relevant libraries
library(ISLR); library(ggplot2);
# load wage data
data(Wage)
# create training and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
# plot relationships between the predictors and outcome
featurePlot(x=training[,c("age","education","jobclass")], y = training$wage,plot="pairs")
qplot(age, wage, color=eduction, data=training) = can be used to separate data by a factor variable (by coloring the points differently)
geom_smooth(method = "lm") = adds a regression line to the plotsgeom=c("boxplot", "jitter") = specifies what kind of plot to produce, in this case both the boxplot and the point cloud# qplot plus linear regression lines
qplot(age,wage,colour=education,data=training)+geom_smooth(method='lm',formula=y~x)
cut2(variable, g=3) = creates a new factor variable by cutting the specified variable into n groups (3 in this case) based on percentiles
cut2 function is part of the Hmisc package, so library(Hmisc) must be run first grid.arrange(p1, p2, ncol=2) = ggplot2 function the print multiple graphs on the same plot
grid.arrange function is part of the gridExtra package, so library(gridExtra) must be run first # load Hmisc and gridExtra packages
library(Hmisc);library(gridExtra);
# cute the wage variable
cutWage <- cut2(training$wage,g=3)
# plot the boxplot
p1 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot"))
# plot boxplot and point clusters
p2 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot","jitter"))
# plot the two graphs side by side
grid.arrange(p1,p2,ncol=2)
table(cutVariable, data$var2) = tabulates the cut factor variable vs another variable in the datasetprop.table(table, margin=1) = converts a table to a proportion table
margin=1 = calculate the proportions based on the rowsmargin=2 = calculate the proportions based on the columns# tabulate the cutWage and jobclass variables
t <- table(cutWage,training$jobclass)
# print table
t
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 92.7) 445 256
## [ 92.7,119.1) 376 347
## [119.1,318.3] 269 409
# convert to proportion table based on the rows
prop.table(t,1)
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 92.7) 0.6348074 0.3651926
## [ 92.7,119.1) 0.5200553 0.4799447
## [119.1,318.3] 0.3967552 0.6032448
qplot(var1, color=var2, data=training, geom="density") = produces density plot for the given numeric and factor variables
# produce density plot
qplot(wage,colour=education,data=training,geom="density")
test set with the mean and standard deviation of the train variables
train(y~x, data=training, preProcess=c("center", "scale")) = preprocessing can be directly specified in the train function
preProcess=c("center", "scale") = normalize all predictors before constructing modelpreProcess(trainingData, method=c("center", "scale") = function in the caret to standardize data
preProcess function as an object and apply it to the train and test sets using the predict function# load spam data
data(spam)
# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preProcess object for all predictors ("-58" because 58th = outcome)
preObj <- preProcess(training[,-58],method=c("center","scale"))
# normalize training set
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# normalize test set using training parameters
testCapAveS <- predict(preObj,testing[,-58])$capitalAve
# compare results for capitalAve variable
rbind(train = c(mean = mean(trainCapAveS), std = sd(trainCapAveS)),
test = c(mean(testCapAveS), sd(testCapAveS)))
## mean std
## train 6.362510e-18 1.000000
## test 7.548133e-02 1.633866
preprocess(data, method="BoxCox") = applies BoxCox transformations to continuous data to help normalize the variables through maximum likelihood
qqnorm(processedVar) = can be used to produce the Q-Q plot which compares the theoretical quantiles with the sample quantiles to see the normality of the datalibrary(e1071)
# set up BoxCox transforms
preObj <- preProcess(training[,-58],method=c("BoxCox"))
# perform preprocessing on training data
trainCapAveS <- predict(preObj,training[,-58])$capitalAve
# plot histogram and QQ Plot
# Note: the transformation definitely helped to
# normalize the data but it does not produce perfect result
par(mfrow=c(1,2)); hist(trainCapAveS); qqnorm(trainCapAveS)
preProcess(data, method="knnImpute") = impute/estimate the missing data using k nearest neighbors (knn) imputation
knnImpute = takes the k nearest neighbors from the missing value and averages the value to impute the missing observations# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1],size=1,prob=0.05)==1
training$capAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58],method="knnImpute")
capAve <- predict(preObj,training[,-58])$capAve
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
# compute differences between imputed values and true values
quantile(capAve - capAveTruth)
## 0% 25% 50% 75% 100%
## -1.8242434686 -0.0005069839 0.0007532392 0.0013598891 0.4062295931
preProcess() can be leveraged to handle creating new covariatesdummyVars(outcome~var, data=training) = creates a dummy variable object that can be used through predict function to create dummy variables
predict(dummyObj, newdata=training) = creates appropriate columns to represent the factor variable with appropriate 0s and 1s
# setting up data
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# create a dummy variable object
dummies <- dummyVars(wage ~ jobclass,data=training)
# create the dummy variable columns
head(predict(dummies,newdata=training))
## jobclass.1. Industrial jobclass.2. Information
## 231655 1 0
## 86582 0 1
## 11443 0 1
## 305706 1 0
## 160191 1 0
## 448410 1 0
nearZeroVar(training, saveMetrics=TRUE) = returns list of variables in training data set with information on frequency ratios, percent uniques, whether or not it has zero variance
freqRatio = ratio of frequencies for the most common value over second most common valuepercentUnique = percentage of unique data points out of total number of data pointszeroVar = TRUE/FALSE indicating whether the predictor has only one distinct valuenzv = TRUE/FALSE indicating whether the predictor is a near zero variance predictornzv = TRUE, those variables should be thrown out # print nearZeroVar table
nearZeroVar(training,saveMetrics=TRUE)
## freqRatio percentUnique zeroVar nzv
## year 1.017647 0.33301618 FALSE FALSE
## age 1.231884 2.85442436 FALSE FALSE
## sex 0.000000 0.04757374 TRUE TRUE
## maritl 3.329571 0.23786870 FALSE FALSE
## race 8.480583 0.19029496 FALSE FALSE
## education 1.393750 0.23786870 FALSE FALSE
## region 0.000000 0.04757374 TRUE TRUE
## jobclass 1.070936 0.09514748 FALSE FALSE
## health 2.526846 0.09514748 FALSE FALSE
## health_ins 2.209160 0.09514748 FALSE FALSE
## logwage 1.011765 18.83920076 FALSE FALSE
## wage 1.011765 18.83920076 FALSE FALSE
splines package] bs(data$var, df=3) = creates 3 new columns corresponding to the var, var2, and var3 termsns() and poly() can also be used to generate polynomialsgam() function can also be used and it allows for smoothing of multiple variables with different values for each variablepredict function # load splines package
library(splines)
# create polynomial function
bsBasis <- bs(training$age,df=3)
# fit the outcome on the three polynomial terms
lm1 <- lm(wage ~ bsBasis,data=training)
# plot all age vs wage data
plot(training$age,training$wage,pch=19,cex=0.5)
# plot the fitted polynomial function
points(training$age,predict(lm1,newdata=training),col="red",pch=19,cex=0.5)
# predict on test values
head(predict(bsBasis,age=testing$age))
## 1 2 3
## [1,] 0.00000000 0.00000000 0.000000000
## [2,] 0.23685006 0.02537679 0.000906314
## [3,] 0.36252559 0.38669397 0.137491189
## [4,] 0.42616898 0.14823269 0.017186399
## [5,] 0.44221829 0.19539878 0.028779665
## [6,] 0.01793746 0.20448709 0.777050955
caret package are computationally intensivedoMC package is recommended to be used for caret computations (reference)
doMC::registerDoMC(cores=4) = registers 4 cores for R to utilizeprcomp Functionpr<-prcomp(data) = performs PCA on all variables and returns a prcomp object that contains information about standard deviations and rotations
pr$rotations = returns eigenvectors for the linear combinations of all variables (coefficients that variables are multiplied by to come up with the principal components) –> how the principal components are createdlog transformation of the variables and adding 1 before performing PCA
plot(pr) = plots the percent variation explained by the first 10 principal components (PC)
# load spam data
data(spam)
# perform PCA on dataset
prComp <- prcomp(log10(spam[,-58]+1))
# print out the eigenvector/rotations first 5 rows and PCs
head(prComp$rotation[, 1:5], 5)
## PC1 PC2 PC3 PC4 PC5
## make 0.019370409 0.0427855959 -0.01631961 0.02798232 -0.014903314
## address 0.010827343 0.0408943785 0.07074906 -0.01407049 0.037237531
## all 0.040923168 0.0825569578 -0.03603222 0.04563653 0.001222215
## num3d 0.006486834 -0.0001333549 0.01234374 -0.01005991 -0.001282330
## our 0.036963221 0.0941456085 -0.01871090 0.05098463 -0.010582039
# create new variable that marks spam as 2 and nospam as 1
typeColor <- ((spam$type=="spam")*1 + 1)
# plot the first two principal components
plot(prComp$x[,1],prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")
caret Packagepp<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2,thresh=0.8)) = perform PCA with preProcess function and returns the number of principal components that can capture the majority of the variation
preProcess object that can be applied using predict functionpcaComp=2 = specifies the number of principal components to compute (2 in this case)thresh=0.8 = threshold for variation captured by principal components
thresh=0.95 = default value, which returns the number of principal components that are needed to capture 95% of the variation in datapredict(pp, training) = computes new variables for the PCs (2 in this case) for the training data set
predict can then be used as data for the prediction model# create train and test sets
inTrain <- createDataPartition(y=spam$type,p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
# create preprocess object
preProc <- preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
# calculate PCs for training data
trainPC <- predict(preProc,log10(training[,-58]+1))
# run model on outcome and principle components
modelFit <- train(training$type ~ .,method="glm",data=trainPC)
# calculate PCs for test data
testPC <- predict(preProc,log10(testing[,-58]+1))
# compare results
confusionMatrix(testing$type,predict(modelFit,testPC))
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 656 41
## spam 82 371
##
## Accuracy : 0.893
## 95% CI : (0.8737, 0.9103)
## No Information Rate : 0.6417
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7724
## Mcnemar's Test P-Value : 0.0003101
##
## Sensitivity : 0.8889
## Specificity : 0.9005
## Pos Pred Value : 0.9412
## Neg Pred Value : 0.8190
## Prevalence : 0.6417
## Detection Rate : 0.5704
## Detection Prevalence : 0.6061
## Balanced Accuracy : 0.8947
##
## 'Positive' Class : nonspam
##
train method
train(outcome ~ ., method="glm", preProcess="pca", data=training) = performs PCA first on the training set and then runs the specified model
preProcess –> predict)# construct model
modelFit <- train(training$type ~ .,method="glm",preProcess="pca",data=training)
# print results of model
confusionMatrix(testing$type,predict(modelFit,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 668 29
## spam 59 394
##
## Accuracy : 0.9235
## 95% CI : (0.9066, 0.9382)
## No Information Rate : 0.6322
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8379
## Mcnemar's Test P-Value : 0.001992
##
## Sensitivity : 0.9188
## Specificity : 0.9314
## Pos Pred Value : 0.9584
## Neg Pred Value : 0.8698
## Prevalence : 0.6322
## Detection Rate : 0.5809
## Detection Prevalence : 0.6061
## Balanced Accuracy : 0.9251
##
## 'Positive' Class : nonspam
##
lm<-lm(y ~ x, data=train) = runs a linear model of outcome y on predictor x –> univariate regression
summary(lm) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p valueslm(y ~ x1+x2+x3, data=train) = run linear model of outcome y on predictors x1, x2, and x3lm(y ~ ., data=train = run linear model of outcome y on all predictorspredict(lm, newdata=df) = use the constructed linear model to predict outcomes (\(\hat Y_i\))for the new values
newdata data frame must have the same variables (factors must have the same levels) as the training datanewdata=test = predict outcomes for the test set based on linear regression model from the training# load data
data(faithful)
# create train and test sets
inTrain <- createDataPartition(y=faithful$waiting, p=0.5, list=FALSE)
trainFaith <- faithful[inTrain,]; testFaith <- faithful[-inTrain,]
# build linear model
lm1 <- lm(eruptions ~ waiting,data=trainFaith)
# print summary of linear model
summary(lm1)
##
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.30541 -0.35970 0.04711 0.37624 1.07683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.870699 0.225663 -8.29 1.01e-13 ***
## waiting 0.075674 0.003137 24.12 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5024 on 135 degrees of freedom
## Multiple R-squared: 0.8117, Adjusted R-squared: 0.8103
## F-statistic: 581.8 on 1 and 135 DF, p-value: < 2.2e-16
# predict eruptions for new waiting time
newdata <- data.frame(waiting=80)
predict(lm1,newdata)
## 1
## 4.183192
# create 1 x 2 panel plot
par(mfrow=c(1,2))
# plot train data with the regression line
plot(trainFaith$waiting,trainFaith$eruptions,pch=19,col="blue",xlab="Waiting",
ylab="Duration", main = "Train")
lines(trainFaith$waiting,predict(lm1),lwd=3)
# plot test data with the regression line
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue",xlab="Waiting",
ylab="Duration", main = "Test")
lines(testFaith$waiting,predict(lm1,newdata=testFaith),lwd=3)
# Calculate RMSE on training and test sets
c(trainRMSE = sqrt(sum((lm1$fitted-trainFaith$eruptions)^2)),
testRMSE = sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2)))
## trainRMSE testRMSE
## 5.836980 5.701161
pi<-predict(lm, newdata=test, interval="prediction") = returns 3 columns for fit (predicted value, same as before), lwr (lower bound of prediction interval), and upr (upper bound of prediction interval)
matlines(x, pi, type="l") = plots three lines, one for the linear fit and two for upper/lower prediction interval bounds# calculate prediction interval
pred1 <- predict(lm1,newdata=testFaith,interval="prediction")
# plot data points (eruptions, waiting)
plot(testFaith$waiting,testFaith$eruptions,pch=19,col="blue")
# plot fit line and prediction interval
matlines(testFaith$waiting,pred1,type="l",,col=c(1,2,2),lty = c(1,1,1), lwd=3)
lm <- train(y ~ x, method="lm", data=train) = run linear model on the training data –> identical to lm function
summary(lm$finalModel) = returns summary of the linear regression model, which will include coefficients, standard errors, \(t\) statistics, and p values –> identical to summary(lm) for a lm objecttrain(y ~ ., method="lm", data=train) = run linear model on all predictors in training data
plot(lm$finalModel) = construct 4 diagnostic plots for evaluating the model
?plot.lm # create train and test sets
inTrain <- createDataPartition(y=Wage$wage,p=0.7, list=FALSE)
training <- Wage[inTrain,]; testing <- Wage[-inTrain,]
# fit linear model for age jobclass and education
modFit<- train(wage ~ age + jobclass + education,method = "lm",data=training)
# store final model
finMod <- modFit$finalModel
# set up 2 x 2 panel plot
par(mfrow = c(2, 2))
# construct diagnostic plots for model
plot(finMod,pch=19,cex=0.5,col="#00000010")
plot(finMod$residuals) = plot the residuals against index (row number)# plot residual by index
plot(finMod$residuals,pch=19,cex=0.5)
* here the residuals increase linearly with the index, and the highest residuals are concentrated in the higher indexes, so there must be a missing variable
party, rpart, tree packages can all build trees\[\hat{p}_{mk} = \frac{\sum_{i}^m \mathbb{1}(y_i = k)}{N_m}\]
\(N_m\) is the size of the group
example
# set margin and seed
par(mar=c(1,1,1,1), mfrow = c(1, 2)); set.seed(1234);
# simulate data
x = rep(1:4,each=4); y = rep(1:4,4)
# plot first scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",15),rep("red",1)),pch=19)
# plot second scenario
plot(x,y,xaxt="n",yaxt="n",cex=3,col=c(rep("blue",8),rep("red",8)),pch=19)
* left graph - Misclassification: \(\frac{1}{16} = 0.06\) - Gini: \(1 - [(\frac{1}{16})^2 + (\frac{15}{16})^2] = 0.12\) - Information:\(-[\frac{1}{16} \times \log_2 (\frac{1}{16})+ \frac{15}{16} \times \log_2(\frac{1}{16})] = 0.34\) * right graph - Misclassification: \(\frac{8}{16} = 0.5\) - Gini: \(1 - [(\frac{8}{16})^2 + (\frac{8}{16})^2] = 0.5\) - Information:\(-[\frac{8}{16} \times \log_2 (\frac{8}{16})+ \frac{8}{16} \times \log_2(\frac{8}{16})] = 1\)
caret Packagetree<-train(y ~ ., data=train, method="rpart") = constructs trees based on the outcome and predictors
rpart object, which can be used to predict new/test valuesprint(tree$finalModel) = returns text summary of all nodes/splits in the tree constructedplot(tree$finalModel, uniform=TRUE) = plots the classification tree with all nodes/splits
rattle package] fancyRpartPlot(tree$finalModel) = produces more readable, better formatted classification tree diagramslibrary(rattle)
library(rpart)
# load iris data set
data(iris)
# create test/train data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# fit classification tree as a model
modFit <- train(Species ~ .,method="rpart",data=training)
# print the classification tree
print(modFit$finalModel)
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.45 35 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Width< 1.65 34 1 versicolor (0.00000000 0.97058824 0.02941176) *
## 7) Petal.Width>=1.65 36 2 virginica (0.00000000 0.05555556 0.94444444) *
# plot the classification tree
rattle::fancyRpartPlot(modFit$finalModel)
# predict on test values
predict(modFit,newdata=testing)
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] versicolor versicolor versicolor versicolor versicolor versicolor
## [31] virginica virginica virginica virginica virginica virginica
## [37] versicolor virginica virginica versicolor versicolor virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
most useful for non-linear models
loess(y ~ x, data=train, span=0.2) = fits a smooth curve to data
span=0.2 = controls how smooth the curve should be# load data
library(ElemStatLearn); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# create empty matrix
ll <- matrix(NA,nrow=10,ncol=155)
# iterate 10 times
for(i in 1:10){
# create sample from data with replacement
ss <- sample(1:dim(ozone)[1],replace=T)
# draw sample from the dataa and reorder rows based on ozone
ozone0 <- ozone[ss,]; ozone0 <- ozone0[order(ozone0$ozone),]
# fit loess function through data (similar to spline)
loess0 <- loess(temperature ~ ozone,data=ozone0,span=0.2)
# prediction from loess curve for the same values each time
ll[i,] <- predict(loess0,newdata=data.frame(ozone=1:155))
}
# plot the data points
plot(ozone$ozone,ozone$temperature,pch=19,cex=0.5)
# plot each prediction model
for(i in 1:10){lines(1:155,ll[i,],col="grey",lwd=2)}
# plot the average in red
lines(1:155,apply(ll,2,mean),col="red",lwd=2)
caret package, there are three options for the train function to perform bagging
bagEarth - Bagged MARS (documentation)treebag - Bagged CART (documentation)bagFDA - Bagged Flexible Discriminant Analysis (documentation)bag functions can be constructed (documentation)
bag(predictors, outcome, B=10, bagControl(fit, predict, aggregate)) = define and execute custom bagging algorithm
B=10 = iterations/resampling to performbagControl() = controls for how the bagging should be executed
fit=ctreeBag$fit = the model ran on each resampling of datapredict=ctreeBag$predict = how predictions should be calculated from each modelaggregate=ctreeBag$aggregate = how the prediction models should be combined/averaged# load relevant package and data
library(party); data(ozone,package="ElemStatLearn")
# reorder rows based on ozone variable
ozone <- ozone[order(ozone$ozone),]
# extract predictors
predictors <- data.frame(ozone=ozone$ozone)
# extract outcome
temperature <- ozone$temperature
# run bagging algorithm
treebag <- bag(predictors, temperature, B = 10,
# custom bagging function
bagControl = bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate))
# plot data points
plot(ozone$ozone,temperature,col='lightgrey',pch=19)
# plot the first fit
points(ozone$ozone,predict(treebag$fits[[1]]$fit,predictors),pch=19,col="red")
# plot the aggregated predictions
points(ozone$ozone,predict(treebag,predictors),pch=19,col="blue")
* process - bootstrap samples from training data (with replacement) - split and bootstrap variables - grow trees (repeat split/bootstrap) and vote/average final trees
rf<-train(outcome ~ ., data=train, method="rf", prox=TRUE, ntree=500) = runs random forest algorithm on the training data against all predictors
prox=TRUE = the proximity measures between observations should be calculated (used in functions such as classCenter() to find center of groups)
rf$finalModel$prox = returns matrix of proximitiesntree=500 = specify number of trees that should be constructeddo.trace=TRUE = prints logs as the trees are being built –> useful by indicating progress to userrandomForest() function can be used to perform random forest algorithm (syntax is the same as train) and is much faster getTree(rf$finalModel, k=2) = return specific tree from random forest modelclassCenters(predictors, outcome, proximity, nNbr) = return computes the cluster centers using the nNbr nearest neighbors of the observations
prox = rf$finalModel$prox = proximity matrix from the random forest modelnNbr = number of nearest neighbors that should be used to compute cluster centerspredict(rf, test) = apply the random forest model to test data set
confusionMatrix(predictions, actualOutcome) = tabulates the predictions of the model against the truths
library(randomForest)
# load data
data(iris)
# create train/test data sets
inTrain <- createDataPartition(y=iris$Species,p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]
# apply random forest
modFit <- train(Species~ .,data=training,method="rf",prox=TRUE)
# return the second tree (first 6 rows)
head(getTree(modFit$finalModel,k=2))
## left daughter right daughter split var split point status prediction
## 1 2 3 3 2.60 1 0
## 2 0 0 0 0.00 -1 1
## 3 4 5 4 1.65 1 0
## 4 6 7 3 5.25 1 0
## 5 0 0 0 0.00 -1 3
## 6 0 0 0 0.00 -1 2
# compute cluster centers
irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox)
# convert irisP to data frame and add Species column
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
# plot data points
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
# add the cluster centers
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)
# predict outcome for test data set using the random forest model
pred <- predict(modFit,testing)
# logic value for whether or not the rf algorithm predicted correctly
testing$predRight <- pred==testing$Species
# tabulate results
table(pred,testing$Species)
##
## pred setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 13 1
## virginica 0 2 14
# plot data points with the incorrect classification highlighted
qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")
more detail tutorial can be found here
example
* we start with space with blue + and red - and the goal is to classify all the object correctly * only straight lines will be used for classification
* from the above, we can see that a group of weak predictors (lines in this case), can be combined and weighed to become a much stronger predictor